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ABSTRACT 



Standard maximum-likelihood estimators for binary-star and exoplanet ec- 
centricities are biased high, in the sense that the estimated eccentricity tends 
to be larger than the true eccentricity. As with most non-trivial observables, a 
simple histogram of estimated eccentricities is not a good estimate of the true 
eccentricity distribution. Here we develop and test a hierarchical probabilistic 
method for performing the relevant meta-analysis, that is, inferring the true ec- 
centricity distribution, taking as input the likelihood functions for the individual- 
star eccentricities, or samplings of the posterior probability distributions for the 
eccentricities (under a given, uninformative prior). The method is a simple im- 
plementation of a hierarchical Bayesian model; it can also be seen as a kind of 
heteroscedastic deconvolution. It can be applied to any quantity measured with 
finite precision — other orbital parameters, or indeed any astronomical measure- 
ments of any kind, including magnitudes, distances, or photometric redshifts — so 
long as the measurements have been communicated as a likelihood function or a 
posterior sampling. 

Subject headings: binaries: general — celestial mechanics, stellar dynamics - 
methods: data analysis — methods: statistical — planetary systems 



With rare exceptions, binary star and exoplanet science hinges not on the specific value 
of any individual eccentricity (or mass or period), but rather on the distribution, or the 
distribution as a function of stellar properties or other parameters. The goal of any statistical 
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study should be to determine the distribution of the quantity of interest — we will concentrate 
on orbital eccentricity for specificity — that would have been observed if the investigator had 
extremely high signal-to-noise data and some (magical) method for precise determination of 
all nuisance parameters, such as instrument calibration and system inclination and so forth. 
That is, the investigator wants the observational-uncertainty-deconvolved distribution of the 
quantity of interest. 

One exciting development in the study of binaries and exoplanets is that many groups 



are building probabilistic modeling software (Ford 2005 Gregory 2005 Balan & Lahav 



2009). Rather than fitting and returning a single set of parameters, These probabilistic 
packages (approximately) sample from the posterior probability distribution under weak- 
prior assumptions. These posterior samplings are much more useful than best-fit parameter 
values, because they permit subsequent investigators to perform probabilistic inference on 
the output without going back to the raw radial- velocity data while still properly propagating 
uncertainties. In this Article, we give an example of probabilistic meta-analysis that becomes 
possible when the parameter outputs for individual stars are probabilistic. 

For any plausibly exoplanet- or binary-hosting star n, there are parameters 

™n) , (!) 

where n n is the velocity amplitude, T n is the period, 4> n is some orbital phase or fiducial time, 
e n is the eccentricity, and zu n is the longitude of perihelion. Fitting to these parameters is 
non-linear and unbiased estimators are rare. For these reasons, maximum-likelihood (or, 
for Gaussian noise, minimum-^ 2 ) parameters are not in general unbiased estimators of the 
true parameters — maximum-likelihood estimators only become unbiased in the limit of an 
infinite amount of data. Almost all single-point estimates, including maximum- a-posteriori 



or median-of-sampling parameters, are also biased in general (Lehmann & Casella 1998) 



Indeed, along these lines, it has been shown that the eccentricity e n of any star n has this 
property: The maximum-likelihood estimate e n of the eccentricity is biased high; any his- 
togram of estimated eccentricities e n will have a mean (and variance) that is higher than that 
of the true distribution of true eccentricities e n ( Shen fc Turner|2008[ ). These results — and 



the incredible diversity of eccentricities observed in exoplanet systems — motivate a concen- 
tration on the eccentricity distribution in what follows. Without the analysis methods we 
propose here, it is possible that conclusions about the high eccentricities of exoplanet systems 
might be over-stated or distorted. 

There are three fundamental approaches to determination of the true eccentricity dis- 
tribution — or true distribution of any quantity — given noisy measurements e n , where n is an 
index over instances (in this case, stars with binary or exoplanet companions). The first (and 
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worst) is just to adopt the estimated values as good estimates of the true values and create 
a histogram (or other density estimate) of the observed values e n . Because these estimators 
are biased, and because they are noisy, the distribution created in this way will have the 
wrong mean and variance; it will be a strongly biased estimate of the true distribution of true 
eccentricities e n . Furthermore, adding new e n estimates from new stars n will not decrease 
these biases; there is no y/N improvement as new data are added. There are suggestions 



of less-biased estimators for eccentricity (Zakamska et al 2010), but anything unbiased in 



eccentricity e will still be biased for any nonlinear function of e, and point estimates still have 
the property that the distribution of point estimates will in general be different in variance 
from the true distribution, if only because of observational noise. 

The second approach is to deconvolve the distribution of maximum-likelihood or best-fit 
e n values. In this approach the investigator recognizes that the distribution of estimates e n 
is the true distribution convolved with the uncertainty distribution, where that uncertainty 
can be described as the probability of estimating e n when the true value is e n , or p(e n \e n ). 
The investigator finds the distribution of true values e n that, when it is convolved with the 
uncertainties, produces the distribution of estimates e n . Done correctly, this will be per- 
formed in a forward-modeling of the observed distribution, starting at the true distribution. 
This method is much more responsible, but when the investigator works at the distribution 
level (that is, not at the individual-star level), the investigator must assume things about the 
distribution of uncertainties, equivalent to assuming that all the stars have the same rela- 
tionship p(e n \e n ) between the estimated and true values. It is also a disadvantage that when 
performed naively, deconvolutions can be very sensitive to histogram binning (or, equiva- 
lently, choices about the density estimation) and are unstable to "ringing" and other issues 
coming from shot noise in the observed distribution. 

The third approach — and the approach taken here — is forward modeling of the het- 
eroscedastic observed data (or eccentricity estimates, if these estimates are not single point 
estimates but rather posterior samplings). That is, the investigator makes a non-parametric 
(or highly parameterized) model of the frequency distribution function f a (e) for the true 
e n values, and finds the best-fit values of the distribution parameters a — the values of the 
parameters ex. that, after convolution with the (suitably transformed) distributions in the 
nuisance parameters, explains best the full set of eccentricity samplings. This is also es- 
sentially deconvolution, but it has the enormous advantage over naive deconvolution that it 
accounts for the fact that different stars have different levels of (and functional forms for) 
uncertainty in the nuisance parameters. This forward-modeling approach is slowly being 



adopted in astrophysics (an early proponent is Loredo 2004); we used a simple version of it 



to estimate the galaxy luminosity function ( Blanton et q/|[2003 ) and the velocity distribution 



in the Galaxy disk (Bovy et al 2009a), and we built a general tool for situations in which 
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distributions are smooth and observational uncertainties are simple (Bovy et al 2009b) 



Here we perform this forward modeling for a situation in which the observational un- 
certainties are not simple (uncertainties are asymmetric; measurements are biased) in an 
area (exoplanet eccentricities) of great current scientific interest. Everything that follows 
is straightforwardly generalized to other parameters and other kinds of systems. For ex- 
ample, parallax-based stellar distances, photometric redshifts, and faint-source fluxes also 



suffer from systematic biases (Lutz & Kelker 1973 Connolly et al 1995 Hogg & Turner 



1998). Scientific results based on these measurements rely on the true distributions, not 



the distributions of (biased) measurements, and in all of these cases, the objects of greatest 
interest have measurements at relatively low precision or low signal-to-noise. Reliable sci- 
entific results can be obtained nonetheless, though only by modeling the data; that is the 
fundamental motivation for this work. 



2. Method 

There are N stars n (1 < n < N), each of which has some number M n of radial velocity 
measurements v n j. For each star n, the set of measurements (data) 

D n = K;}£ n i (2) 

is modeled as being affected by a single companion. We are not explicitly considering multiple 
companion stars or planets at this stage, although the generalization is straightforward. The 
model is 

Vnj = V n + g n (t n j) + E nj , (3) 

where V n is an overall system velocity, the function g(t n j) is the radial velocity equation, and 
the E n j are noise contributions drawn from a Gaussian of zero mean and variance [o^- + S%\ , 
where is the uncertainty variance for the jth observation of star n and S% is a noise 
variance from intrinsic stellar variability and other unmodeled sources of noise. The radial 
velocity equation g n {t) for star n is parameterized by velocity amplitude n n , period T n , 
orbital phase <p n , eccentricity e n , and longitude of perihelion w n . (five parameters). This 
model of one star has 7 parameters (V n , S n , and five orbit parameters per star), which we can 
think of as living in a list oj n , and the model of al N stars has [7 N] continuous parameters 
in a bigger list 

The likelihood S£ n for the seven parameters u) n for star n is just the probability of the 
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data D n for star n given the parameters ui n for star n 
£? n = p(D n \u> n ) 

-2 1nX = Q + E + ^) + E ^ it ~ ' ( 4 ) 

where Q is some constant. This looks like x 2 but is modified for the jitter parameter S„. 

For each system n we imagine that we have been provided (by the exoplanet observing 
or fitting team, say) not the original data, but just a fT-element sample from a posterior 
probability distribution function (posterior PDF) created from the likelihood and an unin- 
formative prior PDF p (u> n ): 

p(ut n \D n ) = -^rp{D n \u n )pQ(u n ) , (5) 

where Z n is a normalization constant (for our purposes). The prior PDF Po(u) n ) will be 
decided not by us but by the exoplanet-fitter; we expect (need) it to be uninformative, for 
example, flat in all parameters, or in their logarithms. For each star n this sampling takes 
the form of a chain of K samples k, each of which is a set of 7 parameters u> n k, such that 
the distribution of the samples is consistent with a random draw from the posterior PDF. 

The total likelihood for all the parameters of all the stars n is just the product of 
the individual-star likelihoods 

se = p{{D n } N n= AM N n=1 ) 

N 

71=1 

This product formulation of the total likelihood makes the implicit assumption that the dif- 
ferent star observations are independent; that is, we are assuming that there are no likelihood 
covariances among the parameters of different systems n. That assumption will be at least 
weakly violated in any real survey of binaries or exoplanets, because different observations 
will share hardware issues and calibration information. 

We want, however, not the likelihood for all the star and orbital parameters but instead 
the likelihood j£f a for the parameters a of the eccentricity distribution f a (e). This requires 
a slight re-thinking, because once we know the true eccentricity distribution, that is a better 
prior PDF to be using than the uninformative prior PDF p (u> n ). It is this process — opening 
up the prior PDF to modeling and putting what we want to infer in the place of the prior 
PDF from previous inferences — that makes the method hierarchical. To make this inference 
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we must change variables and integrate out the individual-star parameters which are — for 
our purposes — nuisance parameters. For the likelihood function J2? a , this change of variables 
and integration is 

2 a = p{{D n } N n=1 \cx) 

N 



^ a = II / duJ n p(D n \u) n )p(u) n \ai) 

n=l J 



p(w n \a) = , (7) 

Po{e n ) 

where the integrals are over the N 7-dimensional parameter spaces and we have multiplied 
the uninfomative prior PDF po{u n ) by a ratio of the eccentricity distribution we want to infer 
to its uninformative counterpart. This is a marginalized likelihood (or "marginal likelihood") 
because we have inserted a prior PDF for the nuisance parameters and integrated them out, 
but left the result in the dimensions of likelihood (probability of data given parameters). 

In multiplying the prior PDF by f a {e)/po{e) we have implicitly assumed that the true 
distribution of parameters is separable; that is, that both the uninformative prior PDF 
and the informative prior PDF parameterized by a can be written as a prior PDF on the 
eccentricity e multiplied by a prior PDF on the other parameters. This is not true in 
general, and is a limitation of this formulation. The limitation is not fundamental; f a (e) 
can be replaced with a multivariate distribution function in the general case. 

The [7 N] -dimensional integral (or product of N 7-dimensional integrals) looks intimi- 
dating, but that integration is exactly the capability that the sampling from each individual 
system n provides for us: Given a i^-element sampling with elements u> n k, 

r i K 

j du n p (uj n \D n ) F(u n ) « _£>(u; nfc ) ; (8) 
J fc=i 

where p () represents the posterior PDF under the uninformative prior PDF upon which the 
sampling is based. The point is that all probability integrals can be approximated as sums 
over samples. So the sampling approximation to the marginalized likelihood for the a. is just 



where all that is inside the sum is the ratio between the uninformative prior PDF (on which 
the sampling is based) and the new prior PDF that we want to infer, and the e n k are the K 
samples of each e n . This sampling approximation to the likelihood J£ a for the parameters a. 



can be optimized to obtain a maximum-likelihood true eccentricity distribution, or it can be 
multiplied by a prior PDF, normalized, and sampled to obtain a posterior PDF sampling for 
the parameters ex. Either way, it is the likelihood that non-trivially enters into our inference. 

The expression for the likelihood J2? Q in equation ^ is an importance-sampling ap- 
proximation to the ratio of Bayes factors (integrals of the posterior probability distribution 
over the nuisance parameters). The comparison of marginalized likelihoods of two models 
(between the default prior PDF and the distribution parameterized by ex or between two 
different values of the parameters a) is equivalent to a marginalized Bayesian comparison 
between two models. It is an importance sampling because it uses a sampling but re-weights 
the samples by the ratio of probabilities between the two models. The usual caveats con- 
cerning importance sampling apply here as well: Since the samples returned by the MCMC 
are generally not independent, the importance-sampling approximation does not improve as 
\/N and if the default prior PDF and the distribution parameterized by ex are very differ- 
ent, the importance-sampling approximation will be noisy. Therefore, we prefer the default 
prior PDF to be uninformative. We also need it to be uninformative to ensure that the pos- 
terior PDF generated with po(e) has support — and samples — wherever the posterior PDF 
generated with f a (e) is significant. 

From an inference perspective, it is more sensible to simultaneously infer ex and all 
the u) n for all the systems, and perform the marginalization on the joint inference. Here 
we use this importance-sampling approximation because by assumption we do not have 
the exoplanet data; we have only the i^-element samplings from the posterior PDFs (the 
posterior PDFs created with the uninformative prior PDF). 

All that remains is to choose functional forms for (parameterizations of) the eccentricity 
distribution function f a (e), and a prior PDF on the parameters cx (if we want to perform 
sampling or further marginalization, which we do). There are many possible choices here, 
and a true Bayesian doesn't choose but rather does them all and includes them all in the 
output. However, for specificity and clarity, we consider only two forms for the eccentricity 
distribution. The first is a step function with M steps: 



M 



/ct(e) = ^ exp(q m ) s(e 



. m— 1 m 

' M ' M 





1 



for x < L 

for L < x < H 

for H < x 



M 




1 



(10) 
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where we have laboriously defined the step function as a mixture of top-hats and given the 
normalization constraint on the elements a m of the parameter vector ex. For the prior PDF 
on the a we use 

M M 

p(a) oc 5(1 - J^expa m ) exp(--e J^[a m - a m _!] 2 ) , (11) 



m=l 



where the delta function ensures the normalization of f a (e), and e is a control parameter 
that controls our expectation that f a (e) be smooth. Of course this smoothness parameter 
e — and the bin number M — should be learned along with ex., but as this is beyond our scope, 
we simply set e = 2 and M = 20. Empirically, this keeps the distributions smooth when the 
data sets get small, but permits good freedom and doesn't influence the results much for large 



data sets, as we show below. This is a simple smoothness prior (Kitagawa & Gersch 1996); 



more sophisticated versions can employ a Gaussian process ( Rasmussen fc Williams||2006 ) on 



the a m — the prior in equation (11) is a special case of this — and the hyper-parameters (only 
e in this case) controlling the smoothness of the distribution can be marginalized out. This 
has the advantage that the number of bins can be very large, but it has the disadvantage 



that sampling highly correlated bin heights is challenging (Murray et al 2010a[b ). 



The second form we consider for the eccentricity distribution is that of the beta distri- 
bution 

r(a + b) [a _ 1} _ [b _ n 
fa[) T{a)T{b) [ ' 

ex = (a,b) , (12) 

where T(z) is the gamma function, and we are redefining the parameter list ex to contain 
the beta distribution shape parameters a and b. This distribution is defined on the interval 
< e < 1 and has remarkable freedom with only two parameters. We take the prior PDF 
on the parameters (a, b) to be flat in the allowed region a > and b > 0. Technically this 
prior PDF is improper, but the posterior PDF under any realistic data set is proper. 

In either case — step function or beta distribution — when we have a set of J samples eXj 
of the distribution-function parameter vector that are effectively samples from the posterior 
probability distribution, we can use that distribution of distributions or else marginalize out 
the parameters. The marginalized distribution (/ a (e)) is given by 

(/"(e)> = 7 X>,(e) , (13) 

J 3=1 

where by u f a j(e) v we mean the distribution f a (e) made using parameters exj of the jth 
posterior sample. 
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3. Experiments 

We generated N = 400 ersatz radial velocity data sets n, each of which contained 
M n = 30 radial velocity measurements scattered uniformly over a time baseline of 1000 days. 
These measurements were created using the model for the radial velocity of the parent star 
of a theoretical exoplanet (or companion star in a binary system) described by equation (J3|. 
We assumed reasonable distributions for the governing parameters of the model, together 
with some (trivial) simplifications: 

Each ersatz parent or primary star was assumed to have a mass of M n = 1 M and 
vanishing overall system velocity (V n = 0) in the absence of its exoplanet (or stellar com- 
panion). Intrinsic radial velocity and primary star mass were, though, left free in fitting. 
Each ersatz exoplanet or companion star was assigned a mass m n drawn from a distribution 
p(m) oc l/m (flat in mm) constrained to lie in [0.1Mj up ] < m n < [10Mj up ]. Compan- 
ions were given orbital periods drawn from a distribution p(T) oc 1/T constrained to lie in 
[2d] < T < [2000 d]. Phases <p n and w n were drawn from uniform distributions < < 2tt, 
except for the inclination i n , which was drawn from a distribution flat in cost. The masses 
M n and m n and inclinations i n enter into the radial velocity model through the K n : 

[2w G] 1//3 m n sin i n 



T n 1/3 [M n + m n ] 2 /3 [l- e 2]V2 



(14) 



The eccentricity e n for each ersatz companion was drawn from one of two frequency 
distributions: The first is an intuitive distribution 



/(e) ' 



e 



(15) 



Z [[l + e] 4 2 4 _ 

(Shen & Turner 2008), where Z is a normalization constant; hereafter we call this distri- 
bution "ST4" . The second is an unrealistic straw-man designed to stress-test the inference 
methodology. It is a Gaussian with mean e = 0.3 and variance (0.05) 2 (but set to zero 
outside of the range < e < 1). 

The error added to each ersatz measurement was drawn from a Gaussian of known ob- 
servational noise variance, but with every point assigned its own individual (heteroscedastic) 
noise variance, uniformly distributed (in variance) in the interval (\/T0 ms -1 ) 2 < cr^- < 
(10 ms -1 ) 2 . For the ersatz observations, the jitter parameters S* 2 were set to vanish, but, as 
with the system velocity and mass of the parent (or primary) star, the jitter parameters S* 2 
were left free in the fitting. 



We optimized, fit, and marginalized the radial velocity model to each of the ersatz 
data sets using Metropolis-Hastings Markov Chain Monte Carlo (MCMC) sampling. For the 
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MCMC we adopted a standard uninformative prior PDF that is flat in ln/c n , flat in lnT n , 
flat in <j) n , flat in w n , and flat in e n . We performed the sampling not in the naive parameter 
space (In K n , lnT n , n , e n , vo n ) but in a better-behaved sampling space of 

(lnT n , n n cos[0 n + w n ],K n sin[0 n + w n ],e n cosw n , e n sin w n ) . (16) 

In this space, sampling is better behaved, but the "natural prior" PDF is not flat in In K n or 
e n , so a compensating prior PDF (or Jacobian) must be multiplied in. We confirmed that 
our sampler is using the correct uninformative prior PDF by running it on empty data sets; 
the resulting no-data samplings are samplings of the prior PDF. 

On each system n we set parameter step-sizes (for a Gaussian proposal distribution in 
the sampling space) such that acceptance ratios were near 0.4. On each system 10 6 links 
of MCMC were run, checked for mixing (convergence), and thinned (subsampled uniformly) 
to produce a set of K — 10 5 nearly independent parameter samples uj nk — independence is 
however not required for the sampling approximation of equation ^ to work. Radial ve- 
locity curves for two ersatz exoplanets are shown in Figure [TJ together with their MCMC 
samplings in period and eccentricity. For each system n we search the chain for the maximum- 
a-posteriori parameters. We also used a modified (simulated annealing) MCMC sampling 
to find the maximum-likelihood parameters uj n , one member of which is the "best-fit" ec- 
centricity e„. 

At this point we have the full NxK sampling e n k- This makes it possible to compute the 
marginalized likelihood of equation ^ for any parameters a. Again, we perform Metropolis- 



Hastings MCMC, but now in the space of ct with the prior PDF of equation (11). The 
proposal distribution used in the MCMC was a small Gaussian perturbation applied to 
every component of a., with Gaussian variance chosen to keep the acceptance ratio near 
0.4. For each experiment 10 4 links of MCMC were run, and checked for mixing. From this 
posterior sampling, we can obtain whatever results are desired, the maximum- a-posteriori 
distribution, the marginalized (mean-of-posterior) distribution, a sampling of distributions 
consistent with the data, or any quantile of the eccentricity distribution. 

In Figure [2] and Figure [3] we show the results of four experiments, two with N = 300 
systems and two with iV = 30, and two with the ST4 input eccentricity distribution, and 
two with the Gaussian input eccentricity distribution. We show the true (input) eccentric- 
ities, the maximum-likelihood eccentricities, and the inferred distribution. In each case, we 
find — as expected — that the marginalized (mean-of-posterior) inferred eccentricity distribu- 
tion is a far better description of the original input eccentricity distribution than the naive 
distribution created by histogramming the maximum-likelihood eccentricity estimates e n . 
The inferred distribution captures well the smooth distribution that was used to generate 
the ersatz data. Furthermore, where the actual finite sampling of true values departs (just 
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by Poisson statistics) from the smooth distribution, the inferred distribution even captures 
those deviations, especially when there are N = 300 systems (Figure [2]). 

Figure [2] and Figure [3] also show that the mean of the eccentricity distribution is over- 
estimated when the best-fit eccentricities e n are used to represent the eccentricity distribu- 
tion, and that the over-estimate does not decrease as the number of objects increases. How- 
ever, the marginalized inferred mean — the mean obtained by marginalizing over samples — is 
a very good estimate of the true mean of the true distribution. The same holds for all of the 
quantiles of the distribution. 

The sampling in ex provides full uncertainty information, including the uncertainty in 
every component of ex and also all of the component-component covariances. In Figure [2] 
and Figure [3] we show only a superimposed sampling. This conveys information about the 
uncertainty distribution in each bin, but it doesn't display the full power of the sampling for 
error analysis and propagation. 

We repeat these experiments but now using the beta distribution for f a (e). Once 
again we perform Metropolis-Hastings MCMC, but now in the space of the beta-distribution 
shape parameters ex = (a,b). Again the proposal distribution used in the MCMC was a 
small Gaussian perturbation applied to the two components of ex, with Gaussian variance 
chosen to keep the acceptance ratio near 0.4. For each experiment 2 x 10 3 links of MCMC 
were run, and checked for mixing. We show the results of the beta-distribution fitting in 
Figure |4} It also performs extremely well. 

We obtained K = 10 5 samples per star, but this is almost certainly overkill. In Figure [5] 
we show how the results change when this sampling is thinned down to just = 50 samples 
of the posterior PDF. The thinning was done uniformly, taking every 2000th sample from 
the parent set of 10 5 . The results are only slightly worse with K = 50; this speeds up the 
code by a factor of 2000, of course. 



4. Discussion 

We have shown that proper inference of the eccentricity distribution outperforms the 
naive approach of histogramming best-fit eccentricity values. Our inference proceeds by in- 
serting the model eccentricity distribution as a prior on the eccentricities, after which a good 
eccentricity distribution is one that makes the data — the set of all exoplanet observations — 
probable. Because this method models the distribution prior to observation, it is effectively 
a deconvolution working on heteroscedastic data; it is a generalization to arbitrarily non- 



Gaussian uncertainties of previous work in this area (Bovy et al 2009b). There is nothing 
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crucial about eccentricity; the method presented here can be applied to any problem in which 
the true distribution f(x) is desired and there are only noisy measurements. As long as the 
measurements are presented as likelihood functions or samplings under an uninformative 
prior, the function f(x) can be inferred as described here. 

Though it is common for investigators to present as measured distribution functions the 
histogram of estimated values, sometimes even worse mistakes are made. For example, it is 
sometimes tempting for an investigator to see the output of the sampler as providing a better 
estimate of the distribution function. The thinking is "well, the object has some probability 
of being in each of these eccentricity bins, so I will add a bit into each bin on its behalf" . 
This thinking is wrong: It convolves the error-convolved distribution with the errors once 
again. We will not cite specific examples (to protect the guilty), but this is occasionally done 
and always incorrect. 

One limitation of this study is that we made a simplifying assumption of a separable 
prior; that is, we imagined that there was one eccentricity distribution for all of the stars in 
the sample. In reality, the eccentricity distribution will depend on parent star and exoplanet 
parameters, and there will be no clean separation of the eccentricity distribution. There are 
two reactions to this. The first is to live with the result, understanding that the distribution 
returned by the method is correct for the specific sample in hand, marginalizing over all other 
parameters. The second is to permit non-separable distribution functions. In this latter case, 
hierarchical modeling becomes even more necessary. All of the correlations between orbital 
parameters or dependences on the parent star's properties can be modeled on the prior level 
and fit as we did here. This situation is not substantially more complicated; it is just that the 
terms in the sum in the likelihood of equation ^ becomes a function of multiple parameters, 
and the parameterization of the function changes. 

Another limitation of this work is that we only considered single-planet systems. This 
was in part because we do not have a simple prior for generating realistic multiple-planet 
systems, but also because there is nothing fundamental that changes if we switch to multiple- 
planet systems. We also permitted additional "jitter" in the observations but did not in fact 
add jitter or any other kind of additional noise or data outliers. Again, addition of these 
things — and proper modeling of them — changes the individual-object likelihood functions 
and samplings, but does not change the procedure by which the eccentricity distribution is 
inferred. 

Finally, this study has some danger of "garbage in, garbage out:" we generated data 
in accord with our general expectations, and fit it in accord with those same expecta- 
tions. This is a limitation of all studies on artificial data, of course. However, we note 
that the parameterized eccentricity distributions that we fit are mixtures of step functions 
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and beta distributions; neither of these are related directly to — or even very appropriate for 
representing — either of the distributions we used to generate the ersatz data. Furthermore, 
the "uninformative" prior PDF we used in the exoplanet samplings were wrong, not just be- 
cause it got the eccentricity prior PDF wrong (the subject of this Article) but because they 
also got the velocity amplitude k prior PDF wrong: The ersatz exoplanets were generated 
with isotropic inclinations and a power-law distribution of masses m; this does not generate 
a power-law frequency distribution in k. That is, this method works well even when the 
priors in the other parameters are substantially wrong. And of course these prior PDFs can 
be inferred too, in a generalization. 

In Figure [2] and Figure [3j there are many objects whose eccentricities have been over- 
estimated by the maximum-likelihood method, some of them drastically. This arises natu- 
rally because the maximum-likelihood eccentricity tends to be higher than the true eccen- 
tricity, but it also arises a bit un-naturally because we have included in these figures some 
ersatz systems for which the exoplanet is not detected at significant signal-to-noise. That is, 
we threw in every system, even though at some periods, masses, and inclinations, the signal- 
to-noise is near or below unity. In most real experiments, these systems are removed (no-one 
measures the eccentricity distribution for undetected planets!). However, this method is not 
thrown off significantly by the low signal-to-noise systems. This is a good property of a 
proper probabilistic data analysis methodology: It is not moved around by the lowest signal- 
to-noise objects. Many popular methods in astrophysics do not have this simple property 
(necessary property, some would say). For example, in principal components analysis, the 
highest-noise systems often dominate the total data variance and therefore dominate the 
analysis. In general, measurements of variance (as measurements of distribution functions 
often are) can be thrown by noisy data. Forward models tend not to be. 

In the real world, an investigator might want to infer the eccentricity distribution, but 
use input from many different exoplanet observers, each of whom might have sampled their 
exoplanets with different uninformative priors po(e). This is not a problem; in the importance 
sampling of equation (|9]), the sum for each object n should make use of the uninformative 
prior po(e) used on object n. Also in the real world, an investigator might want to infer 
the eccentricity distribution using a mixture of measurements, some of which have been 
delivered as samplings, and some of which have been delivered just as maximum-likelihood 
estimates. Unfortunately these latter — maximum-likelihood estimates or any other single- 
point estimates — are nearly useless for modeling. 

The method presented here is a baby step towards a hierarchical Bayesian method. The 
method would be fully hierarchical if we went back to the objects and re-sampled them 
using the inferred prior, or, even better, inferred the prior and performed the individual- 
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object samplings simultaneously. That is, this method becomes fully hierarchical when the 
inferred distribution function is used to improve the individual-object estimates. When the 
measurements are given as likelihood functions, this should be the preferred method. 

Along those lines, we could have performed a less simple but faster hierarchical sampling: 
Rather than marginalizing over the individual eccentricities by summing over a (potentially 
large) number of eccentricity samples for each exoplanet, we can use the provided samplings 
of the individual eccentricities as the basis of an approach that samples both the parameters 
of the eccentricity distribution and the true eccentricities of the planets. This approach 
avoids the sum in equation It also returns updated posterior samples of the individual 
eccentricies using the better prior — the prior inferred from the population of planets. 

Briefly, this approach entails writing down the joint posterior probability of the individ- 
ual eccentricies and the parameters of the eccentricity distribution. Using Bayes's theorem, 
we can write this joint distribution as 

p({e n },a|{£>X =1 ) ex p{{D n } N n=l \{e n })p{{e n }\cx)p{cx). (17) 

By sampling from this posterior distribution, we simultaneously obtain samples of the ec- 
centricities of the individual planets and of the parameters of the eccentricity distribution. 
These eccentricity samples will then have used the better eccentricity prior parameterized by 
ck, rather than the uninformative original prior. For planets detected at low signal-to-noise, 
this leads to more realistic estimates of the eccentricity (see Figure [6]). 



We can re- write equation (17) as 



p({e n }\cx) 



p({e n },a\{D n }" =1 ) oc p({D n }^ =1 \{e n }) p ({e n }) ^ 777 W P( a ) 

The provided eccentricity chains give us samples of the distribution in square brackets. We 
can re-use these samples in a manner similar to that in equation ^ to sample from the joint 
distribution of {e n } and ex. Starting from initial ({e n } > ct^), first (1) Metropolis-sample 
{e n }^ +1 ^ from p({e n }| {D n }^ i=1 , qW) by sampling from 

P{{D n } N n=l \{e n })p,{{e n }) ■ (19) 

which can be done by just picking a random sample from the given eccentricity chains for 
the individual planets — and using the Metropolis-Hastings acceptance probability on 

p({e ri }( l+1 )|aW)p ({eJ«|« W ) 



p({e n }WaW)p ({e n }(*+i)|a«) 



(20) 



Using this acceptance probability ensures that the samples are "importance-resampled" ac- 
cording to the new prior parameterized by a. Then, (2) sample cx^ +1 ' from p(ct\{e n }^ +1 ^) 
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using any MCMC sampler. The resulting sampling of a. can then be used in exactly the 
same way as in Section [3] 

Just to demonstrate the power of the hierarchical approach, in Figure [6] we show a com- 
parison of the maximum-likelihood eccentricity estimates to the mean-of-sampling estimates 
made with the inferred prior. That is, we take the marginalized inferred distribution (/ Q (e)), 
re-weight the eccentricity samples with this inferred distribution, and produce as a point es- 
timate for each system n the mean of the re-weighted sampling. These mean-of-sampling 
estimates, made with a correctly informative prior, are — not surprisingly — better than the 
maximum-likelihood estimates. 

In general, hierarchical inference must become a standard tool in astrophysics going 
forward: Our science is fundamentally statistical, and the objects of greatest interest are 
measured always at the limits of instrumental sensitivity. Hierarchical methods are the right 
tools for these jobs. 
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Fig. 1. — Two example ersatz exoplanets, one with relatively high signal-to-noise and one 
with relatively low. The top panels show the heteroscedastic radial velocity data with the 
true radial velocity model overplotted as a solid grey line, and the the maximum-likelihood 
(best-fit) radial velocity model overplotted as a dashed black line. The middle and bottom 
panels show the distribution of periods and eccentricities in ten percent of the K = 10 5 
samples in the thinned chain from the MCMC. Again, the solid grey line shows the true 
value, and the dashed black line shows the maximimum-likelihood value. 
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Fig. 2. — True, maximum-likelihood, and inferred eccentricity distributions for two samples 
of 300 ersatz exoplanets. The top panels show — as dotted curves — the frequency distribu- 
tions from which the true eccentricities were drawn, the ST4 distribution on the left, and 
the Gaussian distribution on the right (see text for details). Superimposed is shown — as 
histograms — the obtained sampling of true eccentricities. The vertical lines indicate the 
means of the histogrammed distributions. The middle panels show — as histograms — the 
distributions of the maximum-likelihood (best-fit) eccentricities. Again, the vertical lines 
show the means of the histogrammed distributions. The bottom panels show a sampling — as 
a set of superimposed light histograms — of inferred distribution functions, drawn from the 
posterior PDF, and — as a solid line with dots — the marginalized inferred distribution (the 
mean of a 10 4 -point sampling). The vertical line shows the mean of the marginalized inferred 
distribution (the marginalized mean). The dotted curves are repeated in all panels to guide 
the eye. 
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Fig. 3.- 



Same as Figure [2] but for just the first 30 ersatz exoplanets. 
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Fig. 4. — On the left, similar to the bottom-left panel of Figure [2] (300 stars, ST4 distribu- 
tion), but using the beta distribution rather than the step function for f a (e). The true ST4 
distribution is shown with a dotted line, a sampling from the posterior PDF is shown with 
solid grey lines, and the marginalized inferred distribution (the mean of a 2 x 10 3 -point sam- 
pling) is shown with a solid black line. On the right, the same but similar to the bottom-left 
panel of Figure [3] (30 stars, ST4 distribution). 
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Fig. 5. — Same as the bottom row of Figure [2] but thinning the individual-exoplanet posterior 
PDF samplings down by a factor of 2000, from K = 10 5 to K = 50 samples per exoplanet. 
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Fig. 6. — Improving point estimates hierarchically. The top row shows a comparison of true 
eccentricities (for our ersatz exoplanets) with the maximum-likelihood estimates, for the 
ST4 (left) and Gaussian (right) true distributions. The bottom row shows the comparison 
of mean-of-sampling eccentricity estimates (see text), made using as prior PDFs on the 
eccentricity the (highly informative) marginalized inferred distributions of Figure [2j The 
horizontal line near 0.2 in the left figure and near 0.3 in the right figure comes from very low 
signal-to-noise systems (signal-to-noise at or below unity) for which the mean-of-sampling 
estimate ends up being very close to the mean of the prior PDF; this is the only reasonable 
mean estimate when the data are not informative. 



